Search CORE

209 research outputs found

States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning

Author: Daw Nathaniel
Dayan Peter
Gläscher Jan
O'Doherty John P.
Publication venue: 'Elsevier BV'
Publication date: 01/05/2010
Field of study

Reinforcement learning (RL) uses sequential experience with situations (“states”) and outcomes to assess actions. Whereas model-free RL uses this experience directly, in the form of a reward prediction error (RPE), model-based RL uses it indirectly, building a model of the state transition and outcome structure of the environment, and evaluating actions by searching this model. A state prediction error (SPE) plays a central role, reporting discrepancies between the current model and the observed state transitions. Using functional magnetic resonance imaging in humans solving a probabilistic Markov decision task, we found the neural signature of an SPE in the intraparietal sulcus and lateral prefrontal cortex, in addition to the previously well-characterized RPE in the ventral striatum. This finding supports the existence of two unique forms of learning signal in humans, which may form the basis of distinct computational strategies for guiding behavior

Elsevier - Publisher Connector

PubMed Central

Caltech Authors

MPG.PuRe

Anterior Prefrontal Cortex Contributes to Action Selection through Tracking of Recent Reward Trends

Author: Adolphs Ralph
Daw Nathaniel D.
Kovach Christopher K.
O'Doherty John P.
Rudrauf David
Tranel Daniel
Publication venue: 'Society for Neuroscience'
Publication date: 20/06/2012
Field of study

The functions of prefrontal cortex remain enigmatic, especially for its anterior sectors, putatively ranging from planning to self-initiated behavior, social cognition, task switching, and memory. A predominant current theory regarding the most anterior sector, the frontopolar cortex (FPC), is that it is involved in exploring alternative courses of action, but the detailed causal mechanisms remain unknown. Here we investigated this issue using the lesion method, together with a novel model-based analysis. Eight patients with anterior prefrontal brain lesions including the FPC performed a four-armed bandit task known from neuroimaging studies to activate the FPC. Model-based analyses of learning demonstrated a selective deficit in the ability to extrapolate the most recent trend, despite an intact general ability to learn from past rewards. Whereas both brain-damaged and healthy controls used comparisons between the two most recent choice outcomes to infer trends that influenced their decision about the next choice, the group with anterior prefrontal lesions showed a complete absence of this component and instead based their choice entirely on the cumulative reward history. Given that the FPC is thought to be the most evolutionarily recent expansion of primate prefrontal cortex, we suggest that its function may reflect uniquely human adaptations to select and update models of reward contingency in dynamic environments

Caltech Authors

Independent neural computation of value from other people's confidence

Author: Campbell-Meiklejohn Daniel
Daw Nathaniel D
Frith Chris D
Simonsen Arndis
Publication venue: Society for Neuroscience
Publication date: 01/01/2017
Field of study

Expectation of reward can be shaped by the observation of actions and expressions of other people in one's environment. A person's apparent confidence in the likely reward of an action, for instance, makes qualities of their evidence, not observed directly, socially accessible. This strategy is computationally distinguished from associative learning methods that rely on direct observation, by its use of inference from indirect evidence. In twenty-three healthy human subjects, we isolated effects of first-hand experience, other people's choices, and the mediating effect of their confidence, on decision-making and neural correlates of value within ventromedial prefrontal cortex (vmPFC). Value derived from first hand experience and other people's choices (regardless of confidence) were indiscriminately represented across vmPFC. However, value computed from agent choices weighted by their associated confidence was represented with specificity for ventromedial area 10. This pattern corresponds to shifts of connectivity and overlapping cognitive processes along a posterior-anterior vmPFC axis. Task behavior and self-reported self-reliance for decision-making in other social contexts correlated. The tendency to conform in other social contexts corresponded to increased activation in cortical regions previously shown to respond to social conflict in proportion to subsequent conformity (Campbell-Meiklejohn et al., 2010). The tendency to self-monitor predicted a selectively enhanced response to accordance with others in the right temporoparietal junction (rTPJ). The findings anatomically decompose vmPFC value representations according to computational requirements and provide biological insight into the social transmission of preference and reassurance gained from the confidence of others. Significance Statement: Decades of research have provided evidence that the ventromedial prefrontal cortex (vmPFC) signals the satisfaction we expect from imminent actions. However, we have a surprisingly modest understanding of the organization of value across this substantial and varied region. This study finds that using cues of the reliability of other peoples'; knowledge to enhance expectation of personal success generates value correlates that are anatomically distinct from those concurrently computed from direct, personal experience. This suggests that representation of decision values in vmPFC is suborganized according to the underlying computation, consistent with what we know about the anatomical heterogeneity of the region. These results also provide insight into the observational learning process by which someone else's confidence can sway and reassure our choices

Crossref

UCL Discovery

Sussex Research Online

How cognitive and reactive fear circuits optimize escape decisions in humans

Author: Daw Nathaniel
Guo Fangjian
Hassabis Demis
Mobbs Dean
Qi Song
Sun Jiayin
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 20/03/2018
Field of study

Flight initiation distance (FID), the distance at which an organism flees from an approaching threat, is an ecological metric of cost–benefit functions of escape decisions. We adapted the FID paradigm to investigate how fast- or slow-attacking “virtual predators” constrain escape decisions. We show that rapid escape decisions rely on “reactive fear” circuits in the periaqueductal gray and midcingulate cortex (MCC), while protracted escape decisions, defined by larger buffer zones, were associated with “cognitive fear” circuits, which include posterior cingulate cortex, hippocampus, and the ventromedial prefrontal cortex, circuits implicated in more complex information processing, cognitive avoidance strategies, and behavioral flexibility. Using a Bayesian decision-making model, we further show that optimization of escape decisions under rapid flight were localized to the MCC, a region involved in adaptive motor control, while the hippocampus is implicated in optimizing decisions that update and control slower escape initiation. These results demonstrate an unexplored link between defensive survival circuits and their role in adaptive escape decisions

Caltech Authors

Model-based learning protects against forming habits.

Author: Daw Nathaniel D
Gillan Claire M
Otto A Ross
Phelps Elizabeth A
Publication venue: Cogn Affect Behav Neurosci
Publication date: 01/01/2015
Field of study

Studies in humans and rodents have suggested that behavior can at times be "goal-directed"-that is, planned, and purposeful-and at times "habitual"-that is, inflexible and automatically evoked by stimuli. This distinction is central to conceptions of pathological compulsion, as in drug abuse and obsessive-compulsive disorder. Evidence for the distinction has primarily come from outcome devaluation studies, in which the sensitivity of a previously learned behavior to motivational change is used to assay the dominance of habits versus goal-directed actions. However, little is known about how habits and goal-directed control arise. Specifically, in the present study we sought to reveal the trial-by-trial dynamics of instrumental learning that would promote, and protect against, developing habits. In two complementary experiments with independent samples, participants completed a sequential decision task that dissociated two computational-learning mechanisms, model-based and model-free. We then tested for habits by devaluing one of the rewards that had reinforced behavior. In each case, we found that individual differences in model-based learning predicted the participants' subsequent sensitivity to outcome devaluation, suggesting that an associative mechanism underlies a bias toward habit formation in healthy individuals.This work was funded by a Sir Henry Wellcome Postdoctoral Fellowship (101521/Z/12/Z) awarded to C.M.G. ND is supported by a Scholar Award from the McDonnell FoundationThe authors report no conflicts of interest and declare no competing financial interests.This is the final published version. It first appeared at http://link.springer.com/article/10.3758%2Fs13415-015-0347-6

CiteSeerX

Springer - Publisher Connector

PubMed Central

Apollo (Cambridge)

Recommended from our members

Characterizing a psychiatric symptom dimension related to deficits in goal-directed control.

Author: Daw Nathaniel D
Gillan Claire M
Kosinski Michal
Phelps Elizabeth A
Whelan Robert
Publication venue: 'eLife Sciences Publications, Ltd'
Publication date: 01/03/2016
Field of study

Prominent theories suggest that compulsive behaviors, characteristic of obsessive-compulsive disorder and addiction, are driven by shared deficits in goal-directed control, which confers vulnerability for developing rigid habits. However, recent studies have shown that deficient goal-directed control accompanies several disorders, including those without an obvious compulsive element. Reasoning that this lack of clinical specificity might reflect broader issues with psychiatric diagnostic categories, we investigated whether a dimensional approach would better delineate the clinical manifestations of goal-directed deficits. Using large-scale online assessment of psychiatric symptoms and neurocognitive performance in two independent general-population samples, we found that deficits in goal-directed control were most strongly associated with a symptom dimension comprising compulsive behavior and intrusive thought. This association was highly specific when compared to other non-compulsive aspects of psychopathology. These data showcase a powerful new methodology and highlight the potential of a dimensional, biologically-grounded approach to psychiatry research.Funded by a Sir Henry Wellcome Postdoctoral Fellowship (101521/Z/12/Z) awarded to CM Gillan. Claire M Gillan: Wellcome Trust 101521/Z/12/Z Nathaniel D Daw: National Institute on Drug Abuse 1R01DA038891 Nathaniel D Daw: James S. McDonnell Foundation Scholar AwardThis is the final version of the article. It first appeared from eLife Sciences Publications via http://dx.doi.org/10.7554/eLife.1130

Princeton University Open Access Repository

PubMed Central

Apollo (Cambridge)

Model based control can give rise to devaluation insensitive choice

Author: Allan Sean
Daw Nathaniel D.
Garrett Neil
Publication venue: 'Elsevier BV'
Publication date: 01/06/2023
Field of study

Influential recent work aims to ground psychiatric dysfunction in the brain's basic computational mechanisms. For instance, the compulsive symptoms that feature prominently in drug abuse and addiction have been argued to arise from over reliance on a habitual “model-free” system in contrast to a more laborious “model-based” system. Support for this account comes in part from failures to appropriately change behavior in light of new events. Notably, instrumental responding can, in some circumstances, persist despite reinforcer devaluation, perhaps reflecting control by model-free mechanisms that are driven by past reinforcement rather than knowledge of the (now devalued) outcome. However, another line of theory posits a different mechanism – latent causal inference – that can modulate behavioral change. It concerns how animals identify different contingencies that apply in different circumstances, by covertly clustering experiences into distinct groups. Here we combine both lines of theory to investigate the consequences of latent cause inference on instrumental sensitivity to reinforcer devaluation. We show that instrumental insensitivity to reinforcer devaluation can arise in this theory even using only model-based planning, and does not require or imply any habitual, model-free component. These ersatz habits (like laboratory ones) emerge after overtraining, interact with contextual cues, and show preserved sensitivity to reinforcer devaluation on a separate consumption test, a standard control. Together, this work highlights the need for caution in using reinforcer devaluation procedures to rule in (or out) the contribution of different learning mechanisms and offers a new perspective on the neurocomputational substrates of drug abuse

University of East Anglia digital repository

Tonic Dopamine Modulates Exploitation of Reward Learning

Author: Cristianne R M Frazier
Jeff A Beeler
Nathaniel D Daw
Xiaoxi Zhuang
Xiaoxi Zhuang
Publication venue: Frontiers Research Foundation
Publication date: 01/01/2010
Field of study

The impact of dopamine on adaptive behavior in a naturalistic environment is largely unexamined. Experimental work suggests that phasic dopamine is central to reinforcement learning whereas tonic dopamine may modulate performance without altering learning per se; however, this idea has not been developed formally or integrated with computational models of dopamine function. We quantitatively evaluate the role of tonic dopamine in these functions by studying the behavior of hyperdopaminergic DAT knockdown mice in an instrumental task in a semi-naturalistic homecage environment. In this “closed economy” paradigm, subjects earn all of their food by pressing either of two levers, but the relative cost for food on each lever shifts frequently. Compared to wild-type mice, hyperdopaminergic mice allocate more lever presses on high-cost levers, thus working harder to earn a given amount of food and maintain their body weight. However, both groups show a similarly quick reaction to shifts in lever cost, suggesting that the hyperdominergic mice are not slower at detecting changes, as with a learning deficit. We fit the lever choice data using reinforcement learning models to assess the distinction between acquisition and expression the models formalize. In these analyses, hyperdopaminergic mice displayed normal learning from recent reward history but diminished capacity to exploit this learning: a reduced coupling between choice and reward history. These data suggest that dopamine modulates the degree to which prior learning biases action selection and consequently alters the expression of learned, motivated behavior

Crossref

Directory of Open Access Journals

PubMed Central

Frontiers - Publisher Connector

Humans decompose tasks by trading off utility and computational cost

Author: Callaway Frederick
Correa Carlos G.
Daw Nathaniel D.
Griffiths Thomas L.
Ho Mark K.
Publication venue
Publication date: 07/11/2022
Field of study

Human behavior emerges from planning over elaborate decompositions of tasks into goals, subgoals, and low-level actions. How are these decompositions created and used? Here, we propose and evaluate a normative framework for task decomposition based on the simple idea that people decompose tasks to reduce the overall cost of planning while maintaining task performance. Analyzing 11,117 distinct graph-structured planning tasks, we find that our framework justifies several existing heuristics for task decomposition and makes predictions that can be distinguished from two alternative normative accounts. We report a behavioral study of task decomposition (

N=806

) that uses 30 randomly sampled graphs, a larger and more diverse set than that of any previous behavioral study on this topic. We find that human responses are more consistent with our framework for task decomposition than alternative normative accounts and are most consistent with a heuristic -- betweenness centrality -- that is justified by our approach. Taken together, our results provide new theoretical insight into the computational principles underlying the intelligent structuring of goal-directed behavior

arXiv.org e-Print Archive

Directory of Open Access Journals